Skip to content

fix: prevent temp file leak and race condition in save_formatted_audio#1905

Closed
sajisanchu1913-source wants to merge 10 commits into
microsoft:mainfrom
sajisanchu1913-source:fix/save-formatted-audio-temp-file-leak
Closed

fix: prevent temp file leak and race condition in save_formatted_audio#1905
sajisanchu1913-source wants to merge 10 commits into
microsoft:mainfrom
sajisanchu1913-source:fix/save-formatted-audio-temp-file-leak

Conversation

@sajisanchu1913-source
Copy link
Copy Markdown
Contributor

Problem

DataTypeSerializer.save_formatted_audio had two bugs in the Azure storage branch:

  1. File leak: os.remove(local_temp_path) was not in a finally block,
    so if the Azure upload raised an exception, temp_audio.wav was never deleted.

  2. Race condition: The temp file always used the fixed name temp_audio.wav,
    so concurrent calls would clobber each other's WAV before upload.

Fix

  • Replaced fixed temp_audio.wav with tempfile.NamedTemporaryFile to give
    each call a unique path
  • Wrapped the upload block in try/finally so the temp file is always deleted

Tests

  • Added regression test that mocks Azure upload to raise, then asserts
    no new .wav files remain in DB_DATA_PATH

Fixes #1894

sajisanchu1913-source and others added 10 commits May 28, 2026 17:14
- _RemoteDatasetLoader._fetch_zip_from_url:
  - keyword-only args (source, inner_files, cache)
  - streams download (requests stream=True + iter_content) to avoid
    double-buffering large archives
  - md5-keyed disk cache under DB_DATA_PATH / seed-prompt-entries when
    cache=True; named temp file otherwise (cleaned up after parse)
  - validates each inner_files extension against FILE_TYPE_HANDLERS;
    raises ValueError with a member preview if an inner file is missing
  - parses inner files via FILE_TYPE_HANDLERS and returns parsed dicts,
    so the open ZipFile never escapes the worker thread
  - adds the missing import zipfile that broke the previous commit
- _MICDataset:
  - drops unused io / json / requests imports (helper handles them)
  - delegates download + parse to the helper; only owns the seed
    construction loop
  - guards non-string Q values (in addition to NaN moral values)
  - forwards cache from fetch_dataset_async to the helper
  - factors authors into AUTHORS class constant
- Tests:
  - test_moral_integrity_corpus_dataset.py: stops mocking requests.get
    directly; patches _fetch_zip_from_url to return parsed dicts so
    tests don't depend on the helper's internal shape
  - adds test_fetch_dataset_non_string_q and
    test_fetch_dataset_passes_cache_flag
  - hoists imports into the right groups so ruff I001 stops firing
  - removes trailing whitespace / extra newlines
- test_remote_dataset_loader.py: adds TestFetchZipFromUrl covering
  happy path, on-disk caching (hits 1 network call across 2 fetches),
  cache=False does not persist, missing inner file raises ValueError,
  unsupported extension raises ValueError

Verified live against the real MIC.zip: 35,408 unique seeds across
all 6 moral foundations in ~2.4s cold / ~1.3s warm. All 559 dataset
unit tests pass; ruff clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use tempfile.NamedTemporaryFile instead of fixed temp_audio.wav
  to prevent concurrent call collisions
- Wrap Azure upload in try/finally to ensure temp file is always
  deleted even when upload fails
- Add regression test to verify cleanup on upload failure

Fixes microsoft#1894
@sajisanchu1913-source
Copy link
Copy Markdown
Contributor Author

Closing in favor of cleaner PR with single focused commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: save_formatted_audio leaks temp_audio.wav on Azure upload failure

2 participants